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Abstract 

The conditional distribution of the next outcome given the infinite past of a sta- 
tionary process can be inferred from finite but growing segments of the past. Several 
schemes are known for constructing pointwise consistent estimates, but they all de- 
mand prohibitive amounts of input data. In this paper we consider real-valued time 
series and construct conditional distribution estimates that make much more efficient 
use of the input data. The estimates are consistent in a weak sense, and the question 
whether they are pointwise consistent is still open. For finite-alphabet processes one 
may rely on a universal data compression scheme like the Lempel-Ziv algorithm to 
construct conditional probability mass function estimates that are consistent in ex- 
pected information divergence. Consistency in this strong sense cannot be attained 
in a universal sense for all stationary processes with values in an infinite alphabet, 
but weak consistency can. Some applications of the estimates to on-line forecasting, 
regression and classification are discussed. 
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I. Introduction and Overview 

We are motivated by some fundamental questions regarding inference of time series 
that were raised by T. Cover [9] and concerning which significant progress has been made 
during the intervening years. The time series is a stationary process {Xt} with values in 
a set X which may be a finite set, the real line, or a finite dimensional euclidean space. 
For t > let X^ — {Xo,Xi, . . . ,Xt-i) denote the i-past at time t. It is also convenient 
to consider the outcome X — Xq, the t-past = {X_t: ■ ■ ■ ^X^i) and the infinite past 
X" — (. . . , X_2, -'^-i) at time 0. The true process distribution P is unknown a priori but 
is known to fall in the class Vs of stationary distributions on the sequence space X^. 

Cover's list of questions included the following: given that {Xt} is a {0, l}-valued time 
series with an unknown stationary ergodic distribution P, is it possible to infer estimates 
P{Xt — l\X*} of the conditional probabilities P{Xt — 1\X^} from the past X* such that 

[P{Xt = l\X'} - P{Xt = l\X'}]^0 P-almost surely as t ^ oo? (1) 

D. Bailey [5] used the cutting and stacking technique of ergodic theory to prove that the 
answer is negative. A simple proof of this negative result is outlined in Proposition 3 of 
Ryabco [30]. Bailey [5] also discussed a result of Ornstein [22] that provides a positive 
answer to a less demanding question of Cover [9], namely whether there exist estimates 
P{X = based on the past X'^ such that for all P E Vs, 

P{X = P{X = 1\X-} P-almost surely as t ^ oo. (2) 

Ornstein constructed estimates Pk{X — which depend on finite past segments 

X-Kk) — . . . , and which converge almost surely to P{X — for every 

P e Vs- The length X{k) of the data record X'^^*^) depends on the data itself, i.e. X{k) 
is a stopping time adapted to the filtration {a{X~^) : t > 0}. To get estimates satisfying 
(2), simply define P{X = as the estimate Pk{X = llX'^^''^} where k is the 

largest integer such that Pk{X = 1\X~'^^''^ can be evaluated from the data (that is, 
j^-A(fe) jg suffix of the string X~* but X~'^^''~^^^ is not.) The true conditional probability 
P{X — 1\X~*} converges to P{X = l\X~} almost surely by the martingale convergence 
theorem and the estimate P{X — 1\X~^ converges to the same limit, hence 

[P{X = 1\X-^} - P{X = l\X-'}] P-almost surely and in L\P). (3) 

An on-line estimate P{Xt = can be constructed at time t from the past X* in the 

same way as P{X — 1\X~^} was constructed from X~*. By (3) and stationarity 

[P{Xt = l\X'} - P{Xt = l\X'}]^0 in L\P) as t ^ oo. (4) 

Thus the guessing scheme P{Xt = 1\X^} is universally consistent in the weak sense of (4), 
although no guessing scheme can be universally consistent in the pointwise sense of (1). 
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Ornstein's result can be generalized when {Xt} is a stationary process with values in a 
complete separable metric (Polish) space X. Algoet [1] constructed estimates Pk{dx\X~''^^'^'') 
that, with probability one under any P eVs- converge in law to the true conditional distri- 
bution P{dx\X^) oiX = Xq given the infinite past. By setting P{dx\X~^) = Pk{dx\X~'^''''^) 
for A(A;) < t < X{k + 1), one obtains estimates P{dx\X^''^) that almost surely converge in 
law to the random measure P{dx\X^) in the space of probability distributions on X. Thus 
for any bounded continuous function h{x) and any stationary distribution P & Vs, 

J h{x) P{dx\X-*) J h{x) P{dx\X-) P-almost surely. (5) 

A much simpler estimate Pk{dx\X~^^''^) and convergence proof were obtained by Morvai, 
Yakowitz and Gyorfi [21]. Their estimate Pk{X e B\X~^'^''^} of the conditional probability 
of a subset B C X has the structure of a sample mean: 

Pk{x e = Iy1 H^-r» e B}, (6) 

l<i<k 

where the are samples of the process at selected instants in the past and A(A;) is 

the smallest integer t such that the indices {'r(i) : 1 < i < k} can be inferred from the 
segment X^*. From careful reading of [21], one can surmise that A(A;) will be huge for 
relatively small values of the sample size k. Morvai [20] applied the crgodic theorem for 
recurrence times of Ornstein and Weiss [24] and argued that if {Xt} is a stationary ergodic 
finite-alphabet process with positive entropy rate H bits per symbol and C is a constant 
such that 1 < C < 2^, then, with probability one, 

c 

X{k) > C"^' eventually for large k, (7) 

where the height of the exponential tower is k — k^ for some number ko that depends on 
the process realization but not on k. To our knowledge, none of the strongly-consistent 
methods have been applied to any data sets, real or simulated. 

Scarpellini [31] has applied the methods of Bailey [5] and Ornstein [22] to infer the 
conditional expectation E{XT-\{Xg}s<o} of the outcome X^- at some fixed time r > 
given the infinite past of a stationary real- valued continuous-time process {Xf} from past 
experience. The outcomes Xt are assumed to be bounded in absolute value by some fixed 
constant K. Scarpellini constructs estimates by averaging samples taken at a finite number 
of regularly spaced instants in the past and proves that the estimates converge almost surely 
to the desired limit E{Xr\{Xs}s<o}- His generalization of Ornstein's result is not quite 
straightforward, and the difficulty seems to be caused more by the continuity of the range 
space [—K, K] than by the continuity of the time index t. 

These works are of considerable theoretical interest because they point to the limits 
of what can be achieved by way of time series prediction. Pointwise consistency can be 
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attained for all stationary processes, but the estimates are based on enormous data records. 
It is hard to say how much raw data are really needed to get estimates with reasonable 
precision. The nonparametric class of all stationary ergodic processes is very rich and can 
model all sorts of complex nonlinear dynamics with long range dependencies and periodic- 
ities at many different time scales. It is hopeless to get efficient estimates with bounds on 
the convergence rate unless one has a priori information that winnows the range of possibil- 
ities to some manageable subclass. In the literature on nonparametric estimation (e.g. see 
Gyorfi, Hardle, Sarda and Vieu [15] and also Marton and Shields [19] ), one imposes mixing 
conditions on the time scries and then finds that the standard methods are consistent and 
achieve stated asymptotic rates of convergence. These approaches are preferable to the 
universal methods when one is assured of the mixing hypotheses. On the other hand, there 
is essentially no methodology for testing for mixing. 

In the present study we relax the strong consistency requirement and push in the di- 
rection of greater efficiency. Rather than demanding strong consistency or pointwise con- 
vergence in (5), we shall be satisfied with weak consistency or mean convergence in L^(P). 
(Note that mean convergence is equivalent to convergence in probability because the ran- 
dom variables are uniformly bounded.) Being more tolerant in this way enables us to 
significantly reduce the data demands of the algorithm. The estimates will again be de- 
fined as empirical averages of sample values, but the length of the raw data segment that 
must be inspected to collect a given number of samples will grow only polynomially fast 
in the sample size (when X is a finite alphabet) , rather than as a tower of exponentials in 
(7). 

For processes with values in a finite set X, weak consistency means that for any sta- 
tionary distribution P on and any x G X, the estimate P(a;|X~*) = P{X = x\X~'^} 
will converge in mean to the true conditional probability P{x\X~) — P{X — x\X~}: 

P{x\X-') ^ P{x\X~) in L\P), ior any X e X. (8) 

There exist estimates that are universally consistent in a stronger sense. Given a universal 
data compression algorithm or a universal parsimonious modeling scheme for stationary 
processes with values in the finite alphabet X, we shall design estimates P(x|X~*) that are 
consistent in expected information divergence for all stationary P. The expectation of the 
KuUback-Leibler divergence between the conditional probability mass function P{x\X') 
and the estimate P{x\X~*) will vanish in the limit as i — > oo for all P e Vs- 

Ep{I{Pxix-\Px\x-t)}^0, (9) 

where 

i{Px\x-\Pxix-^) = E ^(^l^-)iog flnvr;) ■ (10) 
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Consistency in expected information divergence implies consistency in mean as in (8), and 
is equivalent to the requirement that for any stationary P e 7^^ we have mean convergence 

logP(X|X-*) ^ logP(X|X-) in L^P). (11) 

The constructions of Ornstein [22] and Morvai, Yakowitz and Gyorfi [21] yield estimates 
P{x\X~^) such that (11) holds universally in the pointwise sense, but perhaps not in mean. 

No estimates P{x\X~*) can be consistent in expected information divergence for all sta- 
tionary processes with values in a countable infinite alphabet, but weak consistency as in 
(8) is universally achievable. Barron, Gyorfi and van dcr Mculen [7] consider an unknown 
distribution P{dx) on an abstract measurable space X and construct estimates from in- 
dependent samples so that the estimates are consistent in information divergence and in 
expected information divergence whenever P{dx) has finite Kullback-Leiblcr divergence 
/(P|M) < oo relative to some known probability distribution M{dx) on X . In the present 
paper, the discussion of estimates that are consistent in expected information divergence 
is limited to the finite-alphabet case. 

The organization of the paper is as follows. In Section II we describe an algorithm 
for constructing estimates Pk{dx\X^^'^^^) and prove weak consistency for all stationary 
real-valued time series. The method and its proof applies to time scries with values in 
any (X-compact Polish space. In Section III we transform the estimates Pk{dx\X~^^^'^) 
into estimates P[dx\X~''^) by letting k depend on n. We choose an increasing sequence 
k{n) and define the estimate P{dx\X~'^) as Pk{n){dA-^~^^^^^^^) if ^{^{^)) ^ ^ind as 
some default measure Q{dx) otherwise. If k{n) grows sufficiently slowly with n then the 
data requirement \{k{n)) will seldom exceed the available length n and the estimates 
P((i,T|X^") will be weakly consistent just like the estimates Pf;^„){dx\X~^^^''^'^^'') . Section 
IV is about modeling and data compression and about estimates that are consistent in 
expected information divergence for stationary processes with values in a finite alphabet. 
In Section V, we shift P{dx\X~''^) from time to time t and show that the shifted estimates 
P{dxt\X^) can be used for sequential forecasting or on-line prediction. We show that one 
can make sequential decisions based on the shifted estimates P{dxt\X^) so that the average 
loss per decision converges in mean to the minimum long run average loss that could be 
attained if one could make decisions with knowledge of the true conditional distribution 
of the next outcome given the infinite past at each step. In particular, the average rate of 
incorrect guesses in classification and the average of the mean squared error in regression 
converge to the minimum that could be attained if the infinite past were known to begin 
with. 

We would like to alert the reader about some of our notational conventions. Only one 
level of subscripts or superscripts is allowed in equations that are embedded in the text 
and so we are often forced to adopt the flat functional notation A(A;), X{k{n)), i{k), J{k), 
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T{k,j), etc. However, the equations sometimes look better with nested subscripts and 
superscripts and therefore we prefer to write A^, Xk{n), ^ki Jk, Tj, etc. in the displayed 
equations. We hope that mixing of these notational conventions will not be a source of 
confusion but rather will improve the readability of the paper. Logarithms and entropy 
rates are taken in base 2 unless specified otherwise, and exponential growth rates are really 
doubling rates. 

II. Learning the Conditional Distribution P{dx\X~) 

Let {Xt} be a real- valued stationary time series. The process distribution is unknown 
but shift-invariant. We wish to infer the conditional distribution oi X — Xq given the 
infinite past X^ from past experience. We show that it is very easy to construct weakly 
consistent estimates Pk{dx\X~''^''^'') depending on finite past data segments X^^^'''' such that 
for every bounded continuous function h{x) on X and any stationary distribution P ^Vs, 

Urn J h{x)Pk{dx\X-^'-''^) = J h{x) P{dx\X-) in L\P). (12) 

The estimates Pk{dx\X~^'^''^) will be defined in terms of quantized versions of the process 
{Xt}. Let X denote the real fine and let {Bk}k>i be an increasing sequence of finite subfields 
that asymptotically generate the Borel cr-field on X. Let x i— > [x]'' denote the quantizer that 
maps any point x & X to the atom of Bk that happens to contain x. For any integer i > 1 
let denote the quantized sequence . . . , [X-i]*^). Given any integer J > 1, 

one may search backwards in time and collect J samples of the process at times when 
the quantized ^-past looks exactly hke the quantized i-past at time 0. Let A = X{k, i, J) 
denote the length of the data segment X~^ — {X^x, ■ ■ ■ , -'^-i) that must be inspected to find 
these J samples and let Pk/,j{dx\X~^) denote the empirical distribution of those samples. 
Then Pk^(^j{dx\X~^) will be a good estimate of P{dx\X~) if the sample size J, the context 
length i and the quantizer index k are sufficiently large. In fact, if k and i are fixed and 
the sample size J tends to infinity then by the ergodic theorem, Pk,e,j{dx\X~^'^'''^''^^) will 
converge in law to P{dx\[X'~^]''). If we now refine the context by increasing k and i, then 
P{dx\[X''^]'') will converge in law to P{dx\X~) by the martingale convergence theorem. 
The question is how to turn this limit of limits into a single limit by letting k, i and J 
increase simultaneously to infinity. We must make k and £ large to reduce the bias and we 
must make J large to reduce the variance of the estimates. We will let i and J grow with k 
and show that if £{k) and J{k) are monotonically increasing to infinity then the empirical 
conditional distribution estimate Pk{dx\X''^^'^^) — Pk,e{k),j{k){dx\X~^'^'''^^'^^''^^'^^^) converges 
weakly to P{dx\X~). After this brief outline we now proceed with a detailed development. 

Let {ik}k>i s-nd {Jk}k>i be two nondecreasing unbounded sequences of positive integers. 
We often write i{k) and J{k) instead of ik and Jk- For fixed A; > 1 let {—Tj}j>o and {fj}j>o 
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denote the sequences of past and future recurrence times of the pattern [X Thus we 

set — Tq — and for j = 1, 2, ... we inductively define 

= min{t > rjl, : . . . , [X...^) = ([X_,J^ . . . , (13) 

= minO > f^-, : {[X-e,]\ ■ ■ ■ , [^-i]^ = {[X-e,^t]\ ■ ■ ■ , (14) 

The random variables T{k,j) = Tj and f{k,j) = fj are finite almost surely by Poincare's 
recurrence theorem for the quantized process {[Xj]'^'}, cf. Theorem 6.4.1 of Gray [14]. The 
lengths Afc = X{k) and estimates Pk{dx\X~^^^^) are now defined by the formulas 

\k^X{k)^eik)+r{k,Jk), (15) 

Pk{dx\X-^^) ^ ^ Y: Sx_^,,Jdx), (16) 
'^^ i<j<Jk 

where 6^{dx) is the Dirac measure that places unit mass at the point ^ € A". Thus for any 
Borel set B, the conditional probability estimate 

P,{x e B\x-'^} = ^ Y: i{x_.(.,) g 5} (17) 
'^'^ l<j<Jk 

is obtained by searching for the Jk most recent occurrences of the pattern [X"^*^'^'^]'^ and 
calculating the relative frequency with which the next realized symbols X-r{k,j) hit the set 
B. Wc shall prove that Pk{dx\X^^^'''') is a weakly consistent estimate of P{dx\X^). The 
precise statement and the proof are broken down in two parts. 

Theorem lA. For any set B in the generating Geld Ufc ^fc ^.nd any stationary process 
distribution P & Vs we have mean convergence 

\imPk{X e BIX-^"} = P{X e B\X-} inL\P). (18) 



The proof is somewhat technical and is placed in the Appendix. In the second part we 
argue that the estimators Pk{dx\X~^'^'^^) can be employed to infer the regression function 
E{h{X)\X~} — J h{x) P{dx\X'~) of any bounded continuous function h{x) given the past. 

Theorem IB. Let {Xf} be a real-valued stationary time series. If the fields Bk are gener- 
ated by intervals and the estimator Pk{dx\X~^^''^) is defined as in (16) then for any bounded 
continuous function h{x) on X, 

Mm j h{x)Pk{dx\X-^'') = j h{x) P{dx\X-) in L\P). (19) 
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Proof: Pick some bound M such that < M on A". Given e > there exists an integer 

K and a finite interval K in the field such that 



P{x eK}>i 



e 

M' 



(20) 



If necessary we increase k until k is sufficiently large so that there exists a iS^-nieasurable 
function g{x) such that \h{x) — g{x)\ < e on K. Assuming g{x) = outside K, we have 

\h(x) - g{x) I < f{x) = e l{x e X} + M l{x ^ K}. (21) 

Let Pk and P~ be shorthand for Pk{dx\X~^^'^'') and P{dx\X~). Then 

JhdPk- JhdP- < J\h-g\dPk+ j gdPk- j gdP- + j \g-h\dP-. (22) 

The function g{x) is a finite linear combination of indicator functions of i3„-measurable 
subsets, and Theorem lA implies that / g dP^ converges to / gdP~ in L^: 



E 



jgdh-j 



gdP- 



0. 



(23) 



The function /(x) is measurable and bounded, hence / f dPk converges to / f dP in 
and the expectations converge: 



E J fdPk^Ej fdP-^Ef. 



(24) 



Since \h - g\ < f and Ef < eP{X e K} + M P{X ^ K} < 2e by (20) and (21), it follows 
from (22), (23) and (24) that 



E 



hdP' 



< 2e + e + 2e eventually for large k. 



(25) 



Thus E\ J hdPk — J hdP | ^ 0, and this is the desired conclusion (19). | 

Theorem IB holds in general if A" is a cr-compact Pohsh space and the fields Bk are 
suitably chosen. Indeed, let {Kk}k>i be an increasing sequence of compact subsets with 
union {j^. — X. For any fixed k one may cover with a finite collection of open balls 
having diameter less than e^, where \ as /c — > oo. Let B^ denote the smallest field 
containing B^-i and the sets BnKk where B ranges over all balls in the finite cover of Kk. 
(We start with the trivial field Bq — {0, X}.) Any bounded continuous function h{x) on X 
is uniformly continuous on each compact subset of A". II \h{x)\ < M and e > 0, then for 
sufficiently large k there exists some compact subset K in such that P{X ^ K} <e/M 
and h{x) oscillates less than e on each atom of B^. that is contained in K. Thus there exists 
a B«-measurable function g{x) such that \h{x) — g{x)\ < e on K and g{x) — outside 
K. We can then proceed as in the proof of Theorem IB to prove that for any bounded 
continuous function h{x), 



J h{x) Pk{dx\X-^^^'^) E{h{X)\X-} in L\ 



(26) 
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III. Truncation of the Search Depth 

The estimates Pk{dx\X-^^''^) are based on finite but random length segments of the past. 
We shall transform these into estimates P{dx\X~'^) that depend on finite past segments 
with deterministic length but that still are weakly consistent. The details are somewhat 
more involved than for the strongly consistent estimates in Section I. In terms of the empiri- 
cal conditional distribution Pk,e,j{dx\X~^^'''^''^^) that was defined in the outhne of Section II, 
the question is how fast k, i and J may increase with n so that X{k{n),£{n), J{n)) < n with 
high probability. The weak consistency of the estimates -PA;(n),^(n),j(n)(c^3;|X~'^('^(")'^('*)''^("))) 
will not suffer if we redefine the estimates by assigning some default measure Q{dx) in those 
rare cases when the search depth \{k{n),l{n), J{n)) exceeds the available record length n. 
It is difficult to say what the optimal growth path is for k{n), i{n) and J{n) without prior 
information about the spatial and temporal dependency structure of the process. 

The special case of finite alphabet processes is most interesting and it is simpler because 
only 2 of the 3 parameters k,i,J play a role. We do not need an index for subfields of X 
because the obvious choice for Bk is the field of all subsets of X. Also, it is convenient to 
choose the block length £k equal to k so that Tj is the time for j recurrences of X~''. 

In Section A we recall the ergodic theorem for recurrence times that was derived by 
Wyner and Ziv [34] and by Ornstein and Weiss [24] for finite alphabet processes. In 
Section B we define conditional probability mass function estimates P{x\X~"') and we prove 
consistency in mean if the block length k{n) and the sample size Jk{n) grow deterministically 
and sufficiently slowly with n. In Section C we discuss generalizations for real-valued 
processes. 

A. Recurrence Times 

Let {Xt} be a stationary ergodic process with values in a finite set X. Starting at time 
Tq = 0, the successive recurrence times Tj of the /c-block X"'' are defined as follows: 

- inf{t > : {X_k-t, ■ ■ ■ , ^-i-t) = (^-fc, ■ ■ ■ , ^-i)}- (27) 

If P{X~^ = x~''} > then by the results of Kac [17] (see also Willems [33], Wyner and 
Ziv [34]), 

Hr^\X-'-^-^}-p^^^4^y (28) 
Let H denote the entropy rate of the stationary ergodic process {Xt} in bits per symbol: 

H = lim -^E{logP{X'')} = lim E {log P{X ''')}. (29) 

Wyner and Ziv [34], Theorem 3, invoked Kac's result and the Shannon-McMillan-Breiman 
theorem to prove that Ti cannot grow faster than exponentially with limiting rate H 
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(limsup;;. k ^logrf < H almost surely). Ornstein and Weiss [24] then argued that rf will 
grow exponentially fast almost surely with limiting rate exactly equal to H: 

k~^\ogTi ^ H almost surely. (30) 

Now suppose a sample of size Jk is desired. The total time needed to find Jk = J{k) > 1 
instances of the pattern X~'^ is equal to the recurrence time ^j^^y The ratio Tjq,-^/ can 
be interpreted as the average inter-recurrence time: 

E (31) 



Jk Jk i<j<J^ 

We claim that like Tj, the average inter- recurrence time Tj(fe)/^A: cannot grow faster than 
exponentially with limiting rate H. The proof is based on Kac's result and the lemma that 
was developed by Algoet and Cover [3] to give a simple proof of the Shannon-McMillan- 
Breiman theorem and a more general ergodic theorem for the maximum exponential growth 
rate of compounded capital invested in a stationary market. 

Theorem 2. Let {Xt} be a stationary ergodic process with values in a finite set X and 
with entropy rate H bits per symbol. If = A{k) is a sequence of numbers such that 
J2k 2""^^*^) < oo, then for arbitrary J{k) — J^ > Q we have 

log f ^ j < - log P{X-^) + Afc eventually for large k, (32) 



and consequently 



1 , f^j(k) 



limsup — log < H almost surely. (33) 

k k \ Jk J 



Proof: The inter-recurrence times Tj — rjLi are identically distributed with the same con- 
ditional distribution given X~'' as the first recurrence time . By Kac's result, 

^M(fe)|^"'}^(^-') = JkE{T^\X-'^}P{X-'^) = Jk. (34) 



(A referee pointed out that a result like this was also proved by Gavish and Lempel [13].) 



Thus the random variable Zk — P{X Tj,^-./ Jk has expectation 



E{Zk}^El^P{X-^)E\^-^ X-""^"^^!. (35) 

By the Markov inequality, 

P{\ogZk > Afe} = P{Zk > 2^^=} < 2-''^E{Zk} = 2-^^ (36) 

and by the Borel-Cantelli lemma logZfc < eventually for larger k. This proves (32). 
Assertion (33) follows from (32) upon dividing both sides by k and taking the limsup as 
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k — > oo. Indeed, —k ^logP{X ^) H almost surely by the Shannon-McMillan-Breiman 
theorem and one may choose = 2 log A; so that Ak/k —>■ 0. | 

It is worthwhile to observe that Theorem 2 can be generahzed if the process {Xt} is 
stationary but not necessarily ergodic. Let P be a stationary distribution and let P^i denote 
the ergodic mode of the actual process realization cu. Then by the ergodic decomposition 
theorem (see Theorem 7.4.1 of Gray [14]) and the monotone convergence theorem, 



PiX-" = x-'jEir'j^.^lX-" = X-'} = Y: tP{X-' = x-',T'j^,^ = t} 

l<t<co 

= E [tPAx-'^x-',T'j^,^ = t}p{du) 

= / E tP^{X-'^x-^T!^^,^^t}P{du;) 

= j P^{X-^ = x-^}E^{t\,^\X-^ = x-'}P{du^) 



= Jk- (37) 

It follows that E{P{X-'' ) r)(fc)} = Jk and 

log(T)(;t)/ Jfe) < - log P{X-'') + Afe eventually for large k. (38) 

The Shannon-McMillan-Breiman theorem for stationary nonergodic processes asserts that 
P[X~^) decreases exponentially fast with limiting rate H{P^), so one may conclude that 

lim sup ^log (^^^^ < H{P^) almost surely. (39) 

Thus the average inter-recurrence time Tj^^f,-^/ Jk cannot grow faster than exponentially with 
limiting rate H{Pi_j), the entropy rate of the ergodic mode P^^. 

B. Conditional Probability Mass Function Estimates 

In the finite alphabet case, the general estimator Pk{dx\X^^^^^) that was defined in (16) 
reduces to the conditional probability mass function estimate 

Pk{x\X-'^'^) ^ ^ Y: l{X_r(k,j)^x}. (40) 
"^'^ i<j<Jk 

Here k — is the block length and the sample size Jk is monotonically increasing. The 
recurrence times Tj of the /c-block X''^ were defined inductively for j = 1, 2, 3, . . . in (27). 

We choose a slowly increasing sequence of block lengths k{ri) and set P{x\X~'^) equal to 
Pfe(„)(x|X~'^(*'("))) if this estimate can be computed from the available data segment X~'^. 
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Otherwise, if \k{n) > n, we truncate the search and define P{x\X ") as the default measure 
Q{x) — Thus for n > 0, we define 

P(x\X--) = I A(n)(a;|X-^«"))) if X{k{n)) < n, 
^ ' ^ [Qix) otherwise. ^ ^ 

If k{n) grows sufficiently slowly then truncation is a rare event and P{x\X~"') coincides 
most of the time with the weakly consistent estimator i\(„)(a;|X^'^*^'^*^"''')). The question 
is how fast the block length k{n) and the sample size Jk{n) niay grow to get consistent 
estimates. To answer this question, we use our results about recurrence times. 

The inter-recurrence times Tj — Tj_i have the same conditional distribution and hence 
the same conditional expectation given X~'^ as the first recurrence time r^. The expected 
inter-recurrence time is bounded as follows: 

eI^\=E{t^}= Y1 P{X-'' ^x'''}E{t^\X-'' ^x-''} <\X\\ (42) 

[ Jk ) x->':P{X-k=x-'=}>0 

If Cfc > then by the Markov inequality 

P(%>^|<e,. (43) 

If ejt ^ then P{r)(fc) > Jk\X\''/ek} and if EfeCfe < oo then t)^^.) < Jk\X\''/ek 
eventually for large k by the Borel-CanteUi lemma. This is similar to (32) with = 2"^^'^^. 
Since X{k) — k + Tj^^-,, we see that 

P{X{k{n)) < n} ^ 1 as n ^ oo (44) 

if Jk and k{n) are chosen so that for some > with efe — > 0, 

k{n) + Jfc(n)|'^|'^^"'V^fc(n) < IT- eventually for large n. (45) 

It suffices that k(n) = (1 — e) log|^| n for some < e < 1 and Jk = o(\X\''^^^^^'^^) so that 
Jk{n) = o(n^). (Noninteger values are rounded down to the nearest integer, as usual.) We 
can be slightly more aggressive. 

Theorem 3. Let {Xt} be a stationary process with values in a finite set X and choose 
Q{x) — \X\~^ as default measure in (41 ). If the block length k{n) and the sample size J^n) 
are monotonically increasing to infinity and satisfy 

Jk(n) lA'l'^^") = 0(n), (46) 

then the estimates P{x\X~'") in (41) are consistent in mean: 

PixlX-"") P{x\X-) in L\P). (47) 
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In particular, the estimates P{x\X~"-) are consistent in mean if the block length is k{n) — 
(1 — e) log|;(.| n and the sample size is Jk(n) = n'' for some < e < 1. 

Proof: If the entropy rate H is strictly less than log \ X\ and R is any constant such that 
H < R < log \ X\ then by (33), Tj^^^ is asymptotically bounded by Jk^^'^. It follows that 

^mn)) < 4(n)2^^(") < J,(„)|A'|'=W2(«-'°sl-^l)^W = o(n). (48) 

It is necessary for (46) that k{n) < log|;^^| n eventually for large n since Jk{n) — > oo by 
assumption. Thus X(k(n)) — k(n) + — o(n) and X(k(n)) is upper bounded by n 

eventually for large n. li H — log \ X\ then there is no guarantee that we can collect J^n) 
samples from but the estimate P{x\X~'^) will nevertheless be consistent in mean if 

the default measure is Q{x) — because the outcomes happen to be independent 

identically distributed according to this distribution Q{x) when H — log \ | 

The estimates Pk{x\X~'^^''^) in (40) are consistent in the pointwise sense under cer- 
tain conditions. For example, if {Xt} is a stationary finite-state Markov chain with or- 
der K then the empirical estimates Pk{x\X~^^''^) are averages of bounded random vari- 
ables = x} (j = 1,2,..., Jfe) that are conditionally independent and identi- 
cally distributed given X~^ when k > K. It follows that the estimates Pk{x\X''^'^''^) 
converge exponentially fast in the number of samples to the conditional probability 
P{x\X~^} — P{x\X~} and therefore the estimates are pointwise consistent. It is not 
known whether the estimates Pk{x\X~-^^'^^) converge in the pointwise sense for all finite- 
alphabet stationary time series. 

If we know the entropy rate H in advance we can make use of it. In this case, weak 
consistency is guaranteed if A;(n) = (1 — e)(logn)/i? for some R > H and Jk(n) — 
Indeed, ii H < r < R then X(k{n)) < n eventually for large n since 

X{k{n)) = 

< A;(n) + J(A;(n))2"*^(") 

= o(n). (49) 

If the entropy rate is not known in advance then we must be prepared to deal with the 
worst case of nearly maximum entropy rate. The estimates will be wasteful if the entropy 
rate is low because they exploit only a small portion of the available data segment 
when H < log If k{ri) = (1 — e) log|_:^.| n and Jfe(„) = then the length of the useful 
portion is about 

where a = e -|- (1 — e)H/ log \X\ varies hnearly between e<Q;<lasO<i?< log 
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The length X{k) = k + Tj^f,-^ of the data record X^^^'^^ that must be examined to collect 
Jk samples of the pattern X"*^ grows approximately like Jk^.^^, which is polynomial in 
Jk if Jk grows exponentially fast with k. Also, the length n of the segment X^" is just 
polynomial in the sample size Jk{n) if Jk{n) — i^"^- The strongly consistent estimates of 
Morvai, Yakowitz and Gyorfi [21] are much less efficient: they collect J samples from a 
data record whose length grows like a tower of exponentials in (7). Their samples are very 
sparse because extremely stringent demands are placed on the context where those samples 
are taken. For the weakly consistent estimates of the present study, the demands on context 
arc much less severe and so the samples are much more abundant although perhaps less 
trustworthy. Thus universal prediction is not hopelessly out of computational reach as it 
might seem for an algorithm whose input demands grow as a tower of exponentials in (7). 

C. Weak Consistency for Real- valued Processes 

When X is the real line or a cr-compact Polish space, the estimate Pk{dx\X~'^''^^) is 
defined by the formula in (16). We now choose a nondecreasing unbounded sequence k{n) 
and wc define P{dx\X^'"') as the empirical conditional distribution Pk(^n){dx\X~^^'^^"''^^) if this 
estimate can be computed from the available data segment X^". Otherwise, if Xk{n) > ^) 
we truncate the search and define P{dx\X~'^) as some default measure Q{dx). Thus 



P{dx\X~'^) = [ Pk(n){.dx\X ^Mn)) if Afe(„) < \ 

\ Q{dx) otherwise. 



(51) 



If k{n) grows slowly then truncation is rare and P{dx\X^'^) coincides most of the time with 
the estimator Pk[n){dx\X~^^^^'^'>^) which is weakly consistent. The question is how slowly 
the partition index A:(n), the block length i{k{n)) and the sample size J{k{n)) must grow 
with n to get consistent estimates of P{dx\X'). It suffices that P{\{k(n)) < n} 1. 

Theorem 4. Let {Xt} be a real-valued stationary ergoclic time series and choose Bk, 
ik and Jk as before. Let denote the set of atoms of the finite field Bk and choose a 
nondecreasing unbounded sequence of integers k{n) and numbers ^ such that 

n > ik{n) + 4(n)|2fe(n)|^*^"V^fc(n) eventually for large n. (52) 

Then P{n > Xk{n)} — > 1 as n — > oo, and the estimates P{dx\X~'^) are weakly consistent: 
for every set B in the generating field \JkBk we have 

P{X e S|X-"} ^ P{X e B\X-} in L\P), (53) 

and for every bounded continuous function h{x) we have 

J h{x) P(dx|X-") ^ j h{x) P{dx\X-) in L\P). (54) 
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Proof: The inter-recurrence times Tj — rj^i (j = 1,2,3,...) are identically distributed 
conditionally given the pattern [X~^^'^^]''. By Kac's result, 

E{r'j^k)\[X-''n = JkE{T^\[X-'^f} = (55) 

It follows that 

Eir'j^k)} - JkE{T^} - E P([x-^('=f)^{r,V)l[^-'^'f}<^.|S.r('=^ (56) 

[x-e(k)]k 

(The sum is taken over [x-^^*^)]*^ such that P([x~^('^)]'^) = P{[X~^^''^]'' = [x"^^'^)]^} is strictly 
positive.) By the Markov inequality, 

P{Xk > 4 + Jkl^kf^'^e,} = P{r^%) > ME^f^'^e,} < ^fi^lsj < e,. (57) 

Assertions (53) and (54) follow from Theorem lA and IB because P{£k + Jk\'Ekf^ /ek > 
Afc} — > 1 and hence, in view of assumption (52), 

P{n > h{n) + 4(n)|5fc(„)|^'=<"Vefe(n) > Afe(n)} ^1 as 71 ^ oo. (58) 

This completes the proof of the theorem. | 

The theorem remains valid in the stationary non-ergodic case. Indeed, let P be a 
stationary distribution and let P^ denote the ergodic mode of u. Then one may argue as 
above that Pui{\k < 4 + Jk\'Ekf^^^ / ^k} 1- By the ergodic decomposition theorem and 
Lebesgue's dominated convergence theorem, 

limP{Afe < 4 + Jfe|Sfe|^(*^)/efe} = lip / P,{Afc < 4 + 4|Sfc|^«/efe}P(da;) 

k k J 

- I limP,{Afc < 4 + JklEkl'^'^ekjPiduj) 

= J lP(du;) = 1. (59) 

Thus the conclusions of the theorem also hold for stationary nonergodic processes. 

IV. The Information Theoretic Point of View 

In this section we discuss conditional distribution estimates P{dx\X~"') that are con- 
sistent in expected information divergence. Such estimates are also weakly consistent, but 
the converse is not necessarily true. It is possible to construct estimator sequences that 
are consistent in expected information divergence for all stationary processes with values 
in a finite alphabet, but not for all stationary processes with values in a countable infinite 
alphabet. There are connections with universal gambling or modeling schemes and with 
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universal noiseless data compression algorithms for finite alphabet processes. For more 
information on these subjects see Rissanen and Langdon [28] and Algoet [1]. 

A. Consistency in Expected Information Divergence 

The KuUback-Leibler information divergence between two probability distributions P 
and Q on a measurable space X is defined as follows: if P is dominated by Q then 

/(P|«) = £.{logg)). (60) 

otherwise I{P\Q) — oo. The variational distance is defined as 



\P-Q\\= sup 

-l<h(x)<l 



J hdP- j hdQ 



(61) 



where the supremum is taken over all measurable functions h{x) such that |/?.(a;)| < 1. If 
p = dP/dfi and q = dQ/dfi arc the densities of P and Q relative to a dominating cr-finite 
measure fi then \\P — Q\\ = J \p — qI d/j,. Exercise 17 on p. 58 of Csiszar and Korner [11] 
asserts that 

\P-Qf<I{P\Q). (62) 



2 

It follows that I{P\Q) > with equality iff P = Q. Pinsker [26], pp. 13-15 proved the 
existence of a universal constant F > such that 



I{P\Q) < Ep 



<I{P\Q) + VJI{P\Q). (63) 



Barron [6] simplified Pinsker's argument and proved that the constant F = \/2 is best 
possible when natural logarithms are used in the definition of I{P\Q). 

Let {Xt} be a stationary process with values in a complete separable metric space 
X . The divergence between the true conditional distribution P{dx\X^) and an estimate 
P{dx\X~^) is a nonnegative function of the past X~ which vanishes iff P{dx\X~) = 
P((ix|X~*) P-almost surely. Wc say that the estimates P{dx\X~'^) are consistent in in- 
formation divergence for a class 11 of stationary distributions on X^ if for any P e 11, 

I{Px\x-\Px\x-t) ^0 P-almost surely (64) 

We say that P{dx\X~^) is consistent in expected information divergence for the class 11 if 
for any P e 11, 

Ep{I{Px\x-\Px\x-*)}^0. (65) 

Such estimates are weakly consistent for all distributions in the class 11. Indeed, if h{x) is 
any bounded measurable function on X with norm ||/i||oo = sup^, \h{x)\ then 

\Jh{x)P{dx\X-)-Jh{x)P{dx\X-')\ < \\h\U\Px\x- - Px\x-A\- (66) 



16 



Applying the Csiszar-Kemperman-KuUback inequality (62), we see that 



J h{x) P{dx\X-) - J h{x) P{dx\X-') 



2 



<^^I{Px\x-\Px\x-^). (67) 
loge ' ' 



If P{dx\X~*) is consistent in expected information divergence for 11 then / h{x) P{dx\X~*) 
converges in L'^{P) and also in L^{P) to / h{x) P{dx\X~) whenever P e 11. 

Suppose the outcomes Xt are independent with identical distribution Px on X. Barron, 
Gyorfi and van der Meulen [7] have constructed estimates P{dx\X~''') that are consistent in 
information divergence and in expected information divergence when the true distribution 
Px has finite information divergence I{Px\Mx) < oo relative to some known normalized 
reference measure Mx- Gyorfi, Pali and van der Meulen [16] assume that X is the countable 
set of integers and argue that for arbitrary conditional probability mass function estimates 
P(x|X~"), there exists some distribution Px with finite entropy such that 

I{Px\Px\x-") = oo almost surely for all n. (68) 

Therefore, it is impossible to construct estimates P{dx\X''*) that are consistent in infor- 
mation divergence or in expected information divergence for all independent identically 
distributed processes with values in an infinite space. For stationary processes with values 
in a finite alphabet, the constructions of Ornstein [22] and Morvai, Yakowitz and Gyorfi [21] 
yield estimates P(x|X~*) such that logP{x\X~*) converges almost surely to logP{x\X~). 
It is still an open question as to whether these estimates are consistent in information 
divergence or whether modifications are needed to get such consistency. (The difficulty is 
that small changes in P{x\X~'") cause huge changes in \ogP{x\X~'") when P{x\X~'") is 
small.) However, it is easy to construct estimates P{x\X~*) that are consistent in expected 
information divergence. 

B. Consistent Estimates for Finite-alphabet Processes 

Let {Xt} be a stationary process with values in a finite set X. We shall construct 
conditional probability mass function estimates P{x\X~"') that are consistent in expected 
information divergence for any stationary P e Vs- Such estimates also converge to P{x\X~) 
in mean: for any stationary P e and x & X we have 

PixlX-"") P{x\X-) in L\P). (69) 

An observation of Perez [25] implies that consistency in expected information divergence 
is equivalent to mean consistency of logP{X\X~'^). 

Theorem 5. Let {Xt} be a stationary process with values in a finite alphabet X. A 
sequence of conditional probability mass function estimates P{x\X~'") is consistent in ex- 
pected information divergence iff we have mean convergence 

logP(X|X-") ^ logP(X|X-) inL\ (70) 



17 



Proof: Pinsker's inequality (63) for P{x\X ) and P{x\X "■) asserts that 



I{Px\x-\Px\x-n) < E 



''\p{x\x-n), 



X- 



< I{Px\x- \Px\x-«) + T^I{Pxix- \Px\x-«). 
Taking expectations and using concavity of the square root function, we obtain 

Pix\x-) ' 



(71) 



E{I{Px\x-\Px\x-n)} < E 



log 



^p{x\x- 



< E{I{Pxix-\Px\x-r^)} + r^E{I{Pxix-\Px\x-n)} (72) 

by Jensen's inequality. This suffices to prove the theorem. | 

To construct the estimates P{x\X~^), we start with probability mass functions Qlx"^) 
on the product spaces A"" such that for every stationary distribution P on X^, 



n ^/(Px"|Qx") — as n — > oo. 



(73) 



Several methods are known for constructing such models Q{x'^) - see Section C below. By 
Pinsker's inequality, convergence of the means in (73) is equivalent to mean convergence 



1 , fP(X' 
- log ' 



n 



in L^{P). 



(74) 



Let now Q{x\x~'^) denote a shifted copy of the conditional probability mass function 
Q{xt\x*) that appears in the chain rule expansion Q{x'^) = Ylo<t<nQ{^t\x^)- The estimate 
PlxlX'"-) is defined in terms of Q{x") as 



n 



(75) 



0<t<n 



Theorem 6. Let X be a finite alphabet and let {Q{x"')}n>i be a model sequence such 
that (73) or (74) holds for all P E Vg- Then the conditional probability mass function 
estimates P(a;|X^") are consistent in expected information divergence for the class Vg of 
all stationary process distributions on . 

Proof: The Kullback-Lciblcr divergence functional is convex in both arguments. By the 
definition (75) of P{x\X^'^) and by Jensen's inequality. 



1 



I{Px\x-\Px\x-r^) < - E HPx\x-\Qx\x-*)- 



n 



(76) 



0<t<n 
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Now we take expectations with respect to some distribution P e Vs- By stationarity and 
the chain rule expansion of information divergence, we obtain 

Ep{I{Px\x-\Px\x-r^)} < - E Ep{I{Px\x-\Qx\x-^)} 

0<t<n 

= ^ E Ep{I{Px,\x-xAQxAx^)} 

^ 0<t<n 

= —Ep{I{Pxn\x-\Qx^)} 

Th 

= -Ep{I{Pxn^X-\Pxn)}+-I{Pxn\Qxn). (77) 

n n 

Observe that 

Ep{I{Px^lx-\Pxr^)} = H{X^)-H{X^\X-) (78) 

where i7(X") = Ep{- logP(X")} and i^'(X"|X~) = Ep{- logP(X"|X-)}. The entropy 
rate of the process is defined as H = H{X\X~) = n~^H{X^\X~) = [ hm„ n~^H{X"-), so 
one may conclude that 

^Ep{I{Pxn\x- \Pxr^)} = ^^(^") - ^ ^ as n ^ oo. (79) 

It follows from (77) and (79) that the estimates P{x\X~"') are consistent in expected 
information divergence, as claimed. | 

The procedure which constructs P{x\x~'") from the models Q{x'^) can be reversed. 
Indeed, let {P{x\x~^}t>o be a sequence such that for every stationary distribution P e Vs, 
the expected information divergence of P{x\X'') relative to P{x\X'~*) is finite for all t and 
vanishes in the limit as t — > oo. Let P{xt\x*) be constructed from the t-past at time t 
in the same way as P(a;|a;~*) was constructed from the f-past at time 0. The KuUback- 
Leibler information divergence of the true marginal distribution P(x") with respect to the 
compounded model P(x") = no<t<n -P(^t|^*) admits the chain rule expansion 

I{Px^\Px^)= E Ep{I{Px,ixt\Px,\x^)}. (80) 

0<t<n 

By stationarity 

Ep{I{Px,\xAPxAx^)} = Ep{I{Px\x-APx\x~^)}. (81) 

The divergence of P(a;|X^*) relative to P(,x|X~*) is bounded by the divergence of P{x\X~) 
relative to P(a;|X~*) since we have the decomposition 

IiPx\x-\Px\x-*) = IiPx\x-\Px\x-t) + I{Px\x-APx\x-^)- (82) 

From (80), (81) and (82) one may conclude that 

7(Pxn|Pxn)< E Ep{I{Px\x-\Px\x-t)}. (83) 

0<t<n 
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If the expected divergence between P{x\X~) and P{x\X~^) is finite and vanishes in the hmit 
as t — > cxD then the models P(a;") = Y[o<t<nP{^t\x^) have vanishing expected per-symbol 
divergence: for all P &Vs we have 

n-H{Px^\Px^) ^0- (84) 

The results of Shields [32] imply that there can be no universal bound on the speed 
of convergence in expected information divergence. Indeed, if the expected divergence of 
P{x\X'') relative to P{x\X~^) were always 0{Pt) where Pt — ^ 0, then we could construct 
a modeling scheme {P{xt\x*)}t>o such that the divergence of P(x") relative to P(x") = 
no<t<n P{xt\x*) would be 0{Po + . . . + Pn-i)- The per-symbol divergence of P{x'^) relative 
to P{x^) would vanish with universal rate 0[rr^{l3Q + . . . + /3n-i)], which is impossible. 

To obtain bounds on the per-symbol divergence one must restrict the process distribu- 
tion to some manageable class. In particular, suppose 11 is a class of Markov processes that 
is smoothly parametrized by k free parameters and consider models P{x^) for which the 
per-symbol divergence attains Rissanen's [27] lower bound: 

-I{PxAPx^) = + 0(1)). (85) 

n In 

If we set (5(x") = P{x'^) and define P{x\X~'^) as in (75), then (77) reduces to the bound 

Ep{I{Px\x-\Px\x-^)]<^-^{l + o{l)), Pen. (86) 

It is often possible to construct a prequential modeling scheme {P{xt\x^)}t>Q such that 
the expected divergence of P{xt\X^) relative to P{xt\X^) vanishes hke (fcloge)/(2t) for all 
process distributions in the class 11. An incremental bound of order (/cloge)/(2t) yields a 
normalized cumulative bound of order Xlt<n(^log6)/(2t) ?a (/clogn)/(2n). By shifting 
P{xn\X^) we obtain estimates P{x\X~^) such that the expected divergence of P{x\X~) 
relative to P{x\X~'^) vanishes like (A;loge)/(2n). This bound of order (A;loge)/(2n) for 
P{x\X~'^) is clearly better than the bound (/clogn)/(2n). 

C. Modeling and Data Compression 

Any universal data compression scheme for stationary processes with finite alphabet X 
can be used as a basis for the construction of models Q{x'^) satisfying (73) or (74). Indeed, 
let /(x") denote the length of a uniquely decipherable block-to- variable-length binary code 
for sequences e A"". The redundancy of the code for X'^ is defined as the difference 
between the actual codeword length 1{X^) and the ideal description length — logP(X"): 

r(X") = Z(X") +logP(X"). (87) 
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The expected redundancy Ep{r{X"')} is equal to the information divergence between the 
true probabihty mass function P{x"') and the model 

g'(x") = 2-'(^"), e A'". (88) 

For a universal noiseless coding scheme, the expected per-symbol redundancy will vanish: 

-Ep{r{X'')} = -I{Pxn\Q'xu) ^0 iorallPeVs. (89) 
n n 

The model Q'{x^) is not necessarily normalized but is always a subprobability measure, 
by the Kraft- McMillan inequahty. However, if (89) holds for a sequence of subnormahzed 
models Q'{x") then (89) will certainly hold for the normalized models 

Theorem 4 of Algoet [1] implies that for any stationary ergodic distribution P, the per- 
symbol description length of uniquely decipherable codes is asymptotically bounded below 
almost surely by the entropy rate H{P) = lim„ n~^Ep{— log P{X'^)}: 

hrn^inf n-H{X'') > H{P) P-almost surely. (91) 

It is well known that there exist universal noiseless codes for which the per-symbol de- 
scription length almost surely approaches the entropy rate of the ergodic mode P^; with 
probability one under any stationary distribution P: 

n-^/(X"(w)) ^ H{P^) P-almost surely, for all P G P,. (92) 

This is true in particular for the data compression algorithm of Ziv and Lempel [36], by 
Theorem 12.10.2 of Cover and Thomas [10] or by the results of Ornstein and Weiss [24]. 
Other examples of noiseless codes satisfying (92) for every stationary ergodic P have been 
proposed by Ryabco [29], Ornstein and Shields [23], and Algoet [1]. Choosing the best 
among the given code with length Z(x") and a fixed-length code with length [nlogjA"]] 
and adding one bit of preamble to indicate which code is better, one obtains a uniquely 
decipherable code with length 

l'{x") = 1 + min{/(a;"), fnlog lA']]}. (93) 

The codeword may expand by one bit, but the per-symbol description length is now 
bounded by logjA"] -|- 2n~^ and (92) holds universally not only in the pointwise sense 
but also in mean. The corresponding models Q{x'^) are universal in the sense that for any 

PeVs, 

^ ■ log I TT^A 1^0 P-almost surely and in L\P). (94) 



n 
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Ryabco [29] and Algoet [1] have constructed probability measures Q with marginals 
Q{x"') such that the pointwise convergence in (94) holds for every stationary P & Vs- 
Each marginal Q{x"') is equal to the compounded product Q(a;") = I\o<t<nQ{^t\x*), and 
Ryabco 's scheme has the extra property that when P is finite order Markov, 

log (^ q[x\x] ) ^ ° ^-almost surely. (95) 

Rissanen and Langdon [28] and Langdon [18] previously observed that the Lempel-Ziv 
algorithm defines a sequential predictive modeling scheme Q = {Q{xt\x^)}- The per-symbol 
divergence vanishes pointwise in the Cesaro mean sense, for every P & Vs- 

However, the pointwise convergence in (95) must fail for some P &Vs because the quality 
of the predictive model Q{xt\X*) degrades whenever the Lempel-Ziv incremental parsing 
procedure comes to the end of a phrase. The leaves of the dictionary tree and the nodes 
with few descendants are exactly those where empirical evidence is still lacking to make a 
reliable forecast. The number of times a node has been visited is equal to the number of 
leaves in the subtree rooted at that node, and if this number is small then the predictive 
model for the next symbol is a poor estimate based on few samples. 

If the estimates P(x|X~*) are universally consistent in expected information divergence 
then log[P{X\X-)/P{X\X-^)] ^ in L^{P) for all stationary P e T', by Theorem 5. Thus 
the shifted estimates P{xt\X*) are universally consistent in the sense that for all P e Vs, 

log ( ^ in LUP). (97) 
''\P{X,\X^)J ^ ^ ^ ^ 

Bailey [5] and Ryabco [30] proved that no modeling scheme Q exists such that the pointwise 
convergence in (95) holds for every stationary ergodic distribution P. The argument of [30] 
shows that for any modeling scheme Q there exists a stationary ergodic distribution P on 
where X = {a, b, c} such that P fails to satisfy both (95) and the statement 

P(Xt\X^) -Q(Xt\X^) ^0 P-almost surely. (98) 

The offending P is determined by a Markov chain with a countable set of states {0, 1, 2, . . .}. 
Given that the Markov chain is in state i, it moves to state with probability 1/2 and 
generates the letter a, or it moves to state i + 1 with probability 1/2 and generates the 
letter 6 or c with conditional probability Aj and (1 — A^), where Aj is a parameter equal 
to either 1/3 or 2/3. The distribution of the Markov chain is determined by the infinite 
sequence A = (Aq, Ai, . . .). If the Markov chain is started in its stationary distribution 
then the resulting distribution Pa on the sequence space X°° is stationary ergodic. Exact 
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prediction is impossible when the Markov chain visits a state i which it has not visited 
before, because the predictor doesn't know whether the probabihty Aj/2 of next seeing 
symbol b is equal to 1/3 or 1/6. The Markov chain will visit states with arbitrarily large 
labels i, and the predictor must make inaccurate predictions infinitely often with positive 
probability under distribution Pa for some A = (Ao,Ai,...). 

V. Application to Online Prediction 

In this section we discuss some applications of the estimates P{dx\X~*) to on-line 
prediction, regression and classification. We deal with special cases of a sequential decision 
problem that can be formulated abstractly as follows. 

Let {Xf} be a stationary process with values in the space X and let l{x,a) be a loss 
function on A* x ^ where ^ is a space of possible actions. We assume that X is a com- 
plete and ^ is a compact separable metric space and the loss function l{x,a) is bounded 
and continuous on X x A. We wish to select nonanticipating actions At — At{X*) with 
knowledge of the past X* — {Xq, . . . , Xf-i) so as to minimize the long run average loss per 
decision: 

limsup - ^ l{Xt,At) = Min! (99) 

" 0<t<n 

If the process distribution is known a priori then the optimum strategy is to select 
actions A^ — argmin^g^ E{l{Xt,a)\X*} that attain the minimum conditional expected 
loss given the available information X* at each time t. Suppose P is stationary and let 
L{Xt\X^) denote the expectation of the minimum conditional expected loss given the t-past 
at time t: 

L(X,|X*) = E{l{X„ a;)} = inf E{l{X„ A,)}. (100) 

At=At{Xt) 

Similarly let L{X\X~*) and L{X\X~) denote the minimum expected loss given the t- 
past and the minimum expected loss given the infinite past at time 0. By stationarity 
L{Xt\X*) = L(X|X~*), and L(X|X~*) is clearly monotonically decreasing to a limit which 
by continuity must be L{X\X~). Thus for any stationary distribution P one may define 

L*{P) = i lim L{Xt\X') = i lim L{X\X-') = L{X\X-). (101) 

If P is stationary ergodic then the minimum long run average loss is well defined and almost 
surely equal to L*{P) = L{X\X-) by Theorem 6 of Algoet [2]: 

- K^t,A;) ^ L*{P) P-almost surely and in Li(P). (102) 

^ 0<t<n 

Now suppose the process distribution is unknown a priori. It is shown in Section V.B of 
Algoet [2] that there exist nonanticipating actions A^ = A1[X*) which attain the minimum 
long run average loss L*{P) with probability one under any stationary ergodic process 
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distribution P on X-^ . The actions are constructed by a plug-in approach as follows. 
Choose estimates P{dx\X~*) that converge in law to the conditional distribution P[dx\X~) 
with probability one under any stationary P and construct P{dxt\X^) from X* in the same 
way as P{dx\X~^) was computed from Then Al is defined as an action that attains 
the minimum conditional expected loss given X* under P[dxt\X^): 

A* = argmin ( l{xt, a) P{dxt\X*). (103) 

The average loss incurred by the actions converges pointwise to the minimum long run 
average loss L*(P). 

In this paper we rely on conditional distribution estimates P{dx\X~'^) that are weakly 
consistent but hopefully more efficient than the pointwise consistent estimates of [22], [1], 
[21]. We limit our attention to certain on-line prediction problems, when X = A is a, 
compact separable metric space and the loss l{x,x) is a continuous increasing function of 
the distance between the outcome x and the prediction x. In classification problems X = A 
is a finite set, l{x,x) = l{x ^ x\ is the Hamming distance, and we wish to predict each 
outcome Xt with knowledge of the past X* so as to minimize the long run average rate of 
incorrect guesses. In regression problems A' is a finite closed interval, lix^x) is the squared 
euclidean distance, and the goal is to predict Xt from the past X* so that the long run 
average of the squared prediction error is smallest possible. We show that if the estimates 
P{dx\X~'^) are weakly consistent, then the minimum long run average loss in regression 
and classification is universally attained in the sense of mean convergence in L^{P). The 
proof is based on the following generalization of von Neumann's mean ergodic theorem, 
which parallels Breiman's [8] generalization of Birkhoff's pointwise ergodic theorem. See 
also Perez [25]. 

Lemma. Suppose (Jl, P, T) is a stationary ergodic system. If g and {gt}t>o are inte- 
grable random variables such that Qt ^ g in L^{P), then 

- Yl 9toT*^E{g} inL\P). (104) 

0<t<n 

Proof: The mean ergodic theorem asserts that 

1 



and it is clear that 



^ goT'^E{g} in L\P), (105) 

0<t<n 



- [9toT'-goT']^0 in L\P) (106) 

0<t<n 



since the triangle inequality, stationarity and the assumption E\gt — g\ ^ imply that 



E 



1 



^ Q<t<n 



1 ^ „, „. 1 



J2[gtoT'-goT'] < - E E\gtoT'-goT'\ = - E\gt-g\^0. (107) 



^ 0<t<n ^ 0<t<n 
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Addition of (105) and (106) yields (104). | 



A. Regression 

Let {Xt} be a stationary ergodic real- valued time series with finite variance. We wish to 
predict each outcome Xt with knowledge of the past X* so that the squared prediction error 
iXt — Xtl"^ is smallest possible in the long run average sense. The minimum long run average 
is equal to the minimum mean squared error given the infinite past, that is the variance 
of the innovation X — E{X\X~}. If the outcomes Xt are independent and identically 
distributed then the sample mean Xt = {Xq + . . . + Xt-i)/t is an optimal estimator in the 
long run. It is challenging to construct on-line predictors Xt that asymptotically attain the 
minimum squared prediction error in a universal sense for all stationary ergodic real- valued 
processes with finite variance. Here, we consider the simple case of stationary processes 
with values in a finite interval X = [—K, K] . We do not assume that K is known a priori. 

Let {P{dx\X^^)}t>o denote a weakly consistent sequence of conditional distribution 
estimates as in Section III. Since h{x) = a; is a bounded continuous function on X, it 
follows from Theorem 4 that X^t ^ ^ in probability where 

X^t = J xP{dx\X-^), X = E{X\X-} = J xP{dx\X-). (108) 

Note that X^t is not an estimate of X^t but an estimate of X = Xq based on the t-past 
At time t we consider the conditional distribution estimate P{dxt\X*) and the predictor 

Xt^ I XtP{dxt\X'). (109) 

By construction Xt is the sample mean of some subset of the past outcomes Xq, . . . , Xt-i, 
except in rare cases when Xt is equal to the default value / xQ{dx). The obvious choice 
for Q{dx) is the Dirac measure that places unit mass at x = 0, so that / xQ{dx) — 0. For 
any stationary ergodic process distribution P on we have 

|X-X_tP ^ |X-X|2 in L\P), (110) 

and consequently, by the Lemma, 

- Yl \Xt - Xtf ^ E\X - X\^ in L\P). (Ill) 

^ 0<t<n 

B. On-line Prediction and Classification 

Let {Xt} be a random process with values in a finite set X. We wish to predict the 
outcomes Xt with knowledge of the past X^ so as to minimize the long run average rate of 
incorrect guesses. The best predictor for X — Xq given the infinite past X~ is given by 

X = aigmax P{X = x\X'}. (112) 
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If the process distribution P is stationary ergodic then the minimum long run average rate 
of prediction errors is equal to the error probability 

P{X ^X} = 1- E{P{X = X\X-}} = 1-E (max P{X = x\X-}\ . (113) 

If the process distribution is unknown, we choose some conditional probability mass esti- 
mates P{X — x\X~^} that converge in mean to P{X — x\X~} for every stationary process 
distribution P e Vs and x E X: 

P{X ^ x\X-^} ^ P{X ^ x\X-} in L\P). (114) 

Wc construct P{Xt = x\X^} from the past X* in the same way as P{X = x\X^^} was 
computed from and wc define the predictor 

Xt^axgmax P{Xt^x\XH, (115) 

Theorem 7. Let {Xt} be a stationary ergodic process with values in a finite set X. If tie 
conditional probability estimates P{X — x\X~*} are weakly consistent, then the predictor 
Xt achieves the minimum long run average rate of incorrect guesses in probability. Thus 
for any stationary ergodic distribution P on X'^ we have mean convergence 

- ^{^t + ^i}^P{^ + ^} inL\P). (116) 

0<«n 

Proof: Observe that Xti^^) — X-t{T^oj) where T is the left shift on X^ and where 

X_t = argmaxP{X = x|X-*}. (117) 

For any stationary ergodic P we have, by weak consistency of P{X — x\X~^} and continuity 
of the maximum function, 

max P{X = x\X^^} — > max P{X = x\X~} in probability (US) 

or equivalently 

P{X^X^t\X-^}^P{X^X\X-} in L^(P). (119) 
Since [-P{X = a;|X~} — P{X = x\X~*}] — > in L^{P) by weak consistency and 

\P{X = X_t\X'} - P{X = X_t\X-^}\ < = - = x\X~% (120) 

xex 

we see that 

[P{X ^ X^tlX"} - P{X ^ X^t\X~^}]^0 in L\P). (121) 
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It follows from (119) and (121) that 

P{X = X-t\X-} P{X = X\X-} in L\P) (122) 
and consequently, by the Lemma, 

- J2 P{Xt^Xt\X-X'}^E{P{Xj^X\X-}}^P{X^X} in L\P). (123) 

Now observe that 

At = l{Xt ^ Xt} - P{Xt ^ Xt\X-X'} (124) 

is a bounded martingale difference sequence with respect to the cr-fields a{X''X*) and 
hence 

- E ^* = - E [HXt^Xt}-P{Xt^Xt\X-X'}]^0 in L\P) (125) 

0<t<n ^ 0<t<n 

(and also P-almost surely). In fact, the Cesaro means of At vanish exponentially fast by 
Azuma's [4] exponential inequalities for bounded martingale differences. Addition of (123) 
and (125) yields the conclusion (116). | 

Feder, Merhav and Gutman [12] used the Lempel-Ziv algorithm as a method for se- 
quential prediction of individual sequences. 

C. Problems with Side Information 

A well studied problem in statistical decision theory, pattern recognition and machine 
learning is to infer the class label Xt of an item at time t from a covariate or feature vector 
Yt and a training set X^Y* — (Xq, Yq, . . . , Xt-i, It-i). It is often reasonable to assume that 
the successive pairs {Xt,Yt) are independent and identically distributed, but sometimes 
defective items tend to come in batches or in periodic runs and in those cases it may be 
profitable to exploit dependencies between new items and recent or not so recent items. 
Here we assume that the pair process {{Xt, Yt)} is stationary ergodic and we try to exploit 
statistical dependencies of arbitrarily long range, although we have no idea what kind of 
dependencies to expect a priori. The minimum long run average misclassification rate is 
again equal to P{X ^ X), but now X is the best predictor oi X — X^ given the infinite 
past X~Y~ = (..., X_2, Y_2, X^x-, y^i) and the side information Y — Yq. 

X = argmax P{X = arlX-y-y}. (126) 

The minimum misclassification rate will be asymptotically attained in probability by the 
predictors 

= argmax P{Xt = a;|X*r*yt}, (127) 
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where P{Xt = x\X^Y^Yt} is a shifted version of a conditional probabiUty estimate P{X — 
such that for any stationary process distribution P on (A" x y)^, 

P{X = x\X-^Y-^Y} ^ P{X = x\X-Y-Y} in L\P). (128) 

Such estimates P{X — x\X~^Y~^Y} can be constructed by generahzing the methods of 
Sections II and III. 

In fact, let X and y be complete separable metric spaces and let {Bk}k>i and {Ck}k>i 
be increasing sequences of finite subfields that asymptotically generate the Borel cr-fields 
on X and y. We assume that X is cr-compact and the fields Bk are constructed as in 
the paragraph after Theorem IB. Let [x]'' and [y]*^ denote the atoms of Bk and Ck that 
contain the points x & X and y & y, and consider the sequence of past recurrence times 
Tj — T{k,j) of the pattern [X'^^^'^Y'^'^^^Y]^ . Then for every stationary process distribution 
P on (A" X y)^, 

P{dx\X-^^Y-^^Y) ^ \- Y: 5x_^,uAdx) (129) 

is a weakly consistent estimate of the true conditional distribution P{dx\X~Y~Y). Thus 
all results in this paper remain valid if the decisions can be made with knowledge of not 
only the past but also side information. 
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Appendix 

Let {Bk}k>i be an increasing sequence of finite subfields that asymptotically generate 
the Borel cr-field on X, and suppose the empirical conditional distributions Pk{dx\X^^^'^'>) 
are defined as in (16). Let T denote the left shift on the two-sided sequence space X^. 

Theorem lA. If {Xt} is a stationary process with values in a complete separable metric 
(Polish) space X then for every set B in the generating field Ufc -Bfc, we have 

\imPk{X e B\X-^^} ^ P{X e B\X-} inL^. (130) 

k 

Proof: It follows from the martingale convergence theorem that 

lim P{X e B\[X-^'']''} = P{X e B\X-} (131) 

k 

almost surely and in L^. Thus it suffices to show that £^|6fe| ^0 where 

e, = ^ ^ l{X_,^k,j)eB}-P{XoeB\[X-''']'}. (132) 

We claim that = where 

©fc = T ^ H^f(k,j)eB}-P{XoeB\[X-''']''}. (133) 

Indeed, for any measurable function g{Q) > (including ^'(0) = |0|) and for any integer 
sequence = to < < • • • < tj{k) = t, we have 

[1{t} = tj, 0<j< Jk} = [1{t),_,- ^t-tj,0<j< Jk} giOk)] o (134) 

and consequently, by stationarity, 

EgiQk) = E E E{l{T} = t„0<j<Jk}gie,)} 

t 0=to<ti <•••<* jj,=t 

= E E E{i{fX.j^t-tj,o<j<Jk}g(e,)} 

t o=to<ti<...<tj^=t 

= E E E{l{ft^ii,0<i<Jk}g{ek)}^Eg{Qk). (135) 

* 0=to<ti<...<tj^=t 

Observe that {Tj—l}j>o is an increasing sequence of stopping times adapted to the filtration 
{^t}t>o where 

= <7(. . . , [Xo]^ [x,]', [x^-^n im 

Let J^j denote the (j-field of events that are expressible in terms of the quantized random 
variables [X^]^ at times t < fj. Thus is the cr-field of events F such that Ff]{fj = t} 
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belongs to J-'j^ for all t > 0, and is generated by the family of events {Ft Hi^j — t} '■ 
Ft & J^t, ^ > 0}. One may decompose Qk into the sum 



where 



-^k 0<j<Jk 



= i{x,^k,j) eB}- P{x,^k,j) e b\:f;}, 



(137) 

(138) 
(139) 



Notice that {A^}j>o is a martingale difference sequence with respect to the filtration 
{^j}j>o (in the sense that is ^j!^.i-measurable and E{A'j\J^j} = for all j > 0). 
Since |A^| < 1 and the random variables A^ are orthogonal, we see that 



E 



E A, 



•^k 0<j<J, 



J 



k 0<j<Jk 



72 



(140) 



and consequently (since {E\Z\y < E{\Z\'^} and Jk 00), 



E 



— y 

'^^ 0<j<Jk 



< 



Jk 



0. 



Also observe that for any measurable function g{^) > and any integer t >0, 
and consequently, by stationarity, 

= EE{Hr^ = t}9{H)} = Eg{^'^). 



(141) 



(142) 



(143) 



In particular, setting g(($) = |$| proves that = £^|$ol- By the martingale convergence 

theorem 



$g = P{Xo e S| [X-^]'^} - P{Xo e S| ^ almost surely and in 

and consequently 

E 



y E < y E ^1^)1 

0<j<Jk 



The desired conclusion £^|6fc| 



0<j<Jfc 

follows since 
1 



l + Jk 



Jk 



Em 



0. 



(144) 



(145) 



k 



E a; 



+ E 



J 



E *J 



*; o<j<Jk 



0. 



(146) 
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